Allennlp Predictor class that mimics gec_model #6

ksteimel · 2022-09-28T20:28:07Z

Sorry for the large PR. These changes introduce a gec_predictor class that is able to mimic the behavior of gec_model but in a way that follows the allennlp api.
This does handle multi-iteration prediction. However, gec_model had an option for ensemble prediction and that is currently not handled here.
I have also created a model archive that can be used with this new predictor. You can see the config file for this in test_fixtures/roberta_model. The weights file is downloaded the appropriate location and shared between the predictor and gec_model.. I can tar the folder up and then add it to a public s3 bucket after this PR is approved.

… is called.

…y gec_model, get_token_action removed and placed in model.

There's a bug in here somewhere though because the output isn't exactly the same. I suspect there's an off by one error happening when the indices for edit operations are generated.

…l words and corrected words to the output.

… errors are identified. Fixing offsets in light of addition of START_TOKEN to front.

Everything works now except that the text is getting lowercased somewhere.

…offsets when creating the target (corrected sentence). Copying get_token_action method almost verbatim instead of using something more clever.

…. This is identical to the output for gec_model!

…s well.

test_fixtures/roberta_model/vocabulary/d_tags.txt

ksteimel · 2022-09-29T17:03:54Z

Sorry! I forgot to add Nitin and Damien to the list of reviewers somehow.

desilinguist

I am not that familiar with this codebase so I am going to approve based on what I see and let @Frost45 and @damien2012eng do a more careful review.

damien2012eng · 2022-10-03T14:26:34Z

Great work, @ksteimel !
Just wondering, the intension of this PR is not to replace the gec_model, but to introduce gec_predictor and the config file. Am I understanding it correctly?

Frost45

This is amazing work @ksteimel! Thank you for taking this on! Just a few minor comments.

gector/gec_model.py

gector/gec_predictor.py

gector/seq2labels_model.py

tests/test_gec_predictor.py

Co-authored-by: Sanjna Kashyap <[email protected]>

ksteimel added 19 commits September 27, 2022 17:19

Minor changes to make these tests pass if a cuda device is available.

9ba0aeb

Adding registered names for use by predictor

34386bd

Adding expected test output.

36ab359

Added WIP docstring to GecBERTModel

9bd6a7b

WIP Gec Predictor.

6f3f7ae

words metadata is getting filled if unspecified when text_to_instance…

9dd6bf1

… is called.

Using JustSpacesWordSplitter so that tokenization matches that used b…

64cbc8d

…y gec_model, get_token_action removed and placed in model.

Decode now adds the corrected sentence to the output dict.

10e8069

There's a bug in here somewhere though because the output isn't exactly the same. I suspect there's an off by one error happening when the indices for edit operations are generated.

Updating gitignore to prevent adding .th files

f6a9185

Adding directory fixture as analogue to model archive

657f72b

Fixing errors in modeling code now that model.decode adds the origina…

e3cca59

…l words and corrected words to the output.

Adding conditional so that no correction is performed in decode if no…

1c0025c

… errors are identified. Fixing offsets in light of addition of START_TOKEN to front.

Appending start token when creating instances from json or string.

0f204ea

Start token is expected in ouptut.

67ef511

Everything works now except that the text is getting lowercased somewhere.

Drop START_TOKEN from output_dict["words"]. This interferes with the …

53a94dd

…offsets when creating the target (corrected sentence). Copying get_token_action method almost verbatim instead of using something more clever.

The outputs now no longer have $START_TOKEN in the corrected sentence…

86212b7

…. This is identical to the output for gec_model!

Handling multiple iterations of correction in predictor now.

20a0692

Changed location of weights file so it can be used by gec_predictor a…

0b88028

…s well.

setup is now downloading weights file if it does not already exist.

0b6b508

Frost45 reviewed Sep 28, 2022

View reviewed changes

test_fixtures/roberta_model/vocabulary/d_tags.txt Show resolved Hide resolved

ksteimel requested review from damien2012eng and desilinguist September 29, 2022 17:02

desilinguist approved these changes Sep 29, 2022

View reviewed changes

ksteimel requested a review from mulhod October 5, 2022 14:58

Frost45 approved these changes Oct 5, 2022

View reviewed changes

damien2012eng approved these changes Oct 5, 2022

View reviewed changes

Frost45 self-requested a review October 6, 2022 18:27

damien2012eng self-requested a review October 11, 2022 18:54

ksteimel and others added 6 commits October 12, 2022 12:42

Apply suggestions from code review

6f8f23c

Co-authored-by: Sanjna Kashyap <[email protected]>

Removing unused imports, adding docstrings.

7427f93

Removing unused predictions to labeled_instances method.

73aef1b

Updated docstring for decode()

944993c

Removed unused imports.

709ba28

Adding back import of gec_predictor that shouldn't have been removed

b04376c

ksteimel merged commit c937981 into master Oct 24, 2022

Frost45 deleted the feature/predictor_from_gec_model branch October 25, 2022 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allennlp Predictor class that mimics gec_model #6

Allennlp Predictor class that mimics gec_model #6

ksteimel commented Sep 28, 2022

ksteimel commented Sep 29, 2022

desilinguist left a comment

damien2012eng commented Oct 3, 2022

Frost45 left a comment

Allennlp Predictor class that mimics gec_model #6

Allennlp Predictor class that mimics gec_model #6

Conversation

ksteimel commented Sep 28, 2022

ksteimel commented Sep 29, 2022

desilinguist left a comment

Choose a reason for hiding this comment

damien2012eng commented Oct 3, 2022

Frost45 left a comment

Choose a reason for hiding this comment